R Spark basics

This notebook shows a few very simple steps with Spark R


In [1]:
# Load the SparkR package. 
# It will likely show a few warnings about functions that the package overrides
library(SparkR)


Attaching package: 'SparkR'

The following objects are masked from 'package:stats':

    cov, filter, lag, na.omit, predict, sd, var

The following objects are masked from 'package:base':

    colnames, colnames<-, intersect, rank, rbind, sample, subset,
    summary, table, transform


In [4]:
# In the IRkernel we do not have an automatically created Spark Context, as in Python & Scala. 
# We need to initialize the kernel to fetch one. That takes a few moments.
sc <- sparkR.init( "local[*]" );


Re-using existing Spark Context. Please stop SparkR with sparkR.stop() or restart R to create a new Spark Context

In [5]:
# Once we have it, we can also obtain an SQL context
sqlContext <- sparkRSQL.init(sc)

In [6]:
# Do something to prove it works

# Load one of the standard datasets that come pre-packaged with R
data(iris)

# Turn the dataset into an SparkR DataFrame
df <- createDataFrame(sqlContext, iris)

# Inspect it
head( filter(df, df$Petal_Width > 0.2) )


Warning message:
In FUN(X[[i]], ...): Use Sepal_Length instead of Sepal.Length  as column nameWarning message:
In FUN(X[[i]], ...): Use Sepal_Width instead of Sepal.Width  as column nameWarning message:
In FUN(X[[i]], ...): Use Petal_Length instead of Petal.Length  as column nameWarning message:
In FUN(X[[i]], ...): Use Petal_Width instead of Petal.Width  as column name
Out[6]:
Sepal_LengthSepal_WidthPetal_LengthPetal_WidthSpecies
15.43.91.70.4setosa
24.63.41.40.3setosa
35.74.41.50.4setosa
45.43.91.30.4setosa
55.13.51.40.3setosa
65.73.81.70.3setosa

In [ ]:
# sc is an existing SparkContext.
# hiveContext <- sparkRHive.init(sc)

In [ ]: